{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
    "\n",
    "<i>Licensed under the MIT License.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NPA: Neural News Recommendation with Personalized Attention\n",
    "NPA \\[1\\] is a news recommendation model with personalized attention. The core of NPA is a news representation model and a user representation model. In the news representation model we use a CNN network to learn hidden representations of news articles based on their titles. In the user representation model we learn the representations of users based on the representations of their clicked news articles. In addition, a word-level and a news-level personalized attention are used to capture different informativeness for different users.\n",
    "\n",
    "## Properties of NPA:\n",
    "- NPA is a content-based news recommendation method.\n",
    "- It uses a CNN network to learn news representation. And it learns user representations from their clicked news articles.\n",
    "- A word-level personalized attention is used to help NPA attend to important words for different users.\n",
    "- A news-level personalized attention is used to help NPA attend to important historical clicked news for different users.\n",
    "\n",
    "## Data format:\n",
    "\n",
    "### train data\n",
    "One simple example: <br>\n",
    "\n",
    "`1 0 0 0 0 Impression:0 User:2903 CandidateNews0:27006,11901,21668,9856,16156,21390,1741,2003,16983,8164 CandidateNews1:8377,10423,9960,5485,20494,7553,1251,17232,4745,9178 CandidateNews2:1607,26414,25830,16156,15337,16461,4004,6230,17841,10704 CandidateNews3:17323,20324,27855,16156,2934,14673,551,0,0,0 CandidateNews4:7172,3596,25442,21596,26195,4745,17988,16461,1741,76 ClickedNews0:11362,8205,22501,9349,12911,20324,1238,11362,26422,19185 ...`\n",
    "<br>\n",
    "\n",
    "In general, each line in data file represents one positive instance and n negative instances in a same impression. The format is like: <br>\n",
    "\n",
    "`[label0] ... [labeln] [Impression:i] [User:u] [CandidateNews0:w1,w2,w3,...] ... [CandidateNewsn:w1,w2,w3,...] [ClickedNews0:w1,w2,w3,...] ...`\n",
    "\n",
    "<br>\n",
    "\n",
    "It contains several parts seperated by space, i.e. label part, Impression part `<impresison id>`, User part `<user id>`, CandidateNews part, ClickedHistory part. CandidateNews part describes the target news article we are going to score in this instance, it is represented by (aligned) title words. To take a quick example, a news title may be : `Trump to deliver State of the Union address next week` , then the title words value may be `CandidateNewsi:34,45,334,23,12,987,3456,111,456,432`. <br>\n",
    "ClickedNewsk describe the k-th news article the user ever clicked and the format is the same as candidate news. Words are aligned in news title. We use a fixed length to describe an article, if the title is less than the fixed length, just pad it with zeros.\n",
    "\n",
    "### test data\n",
    "One simple example: <br>\n",
    "`1 Impression:0 User:6446 CandidateNews0:18707,23848,13490,10948,21385,11606,1251,16591,827,28081 ClickedNews0:27838,7376,16567,28518,119,21248,7598,9349,20324,9349 ClickedNews1:7969,9783,1741,2549,27104,14669,14777,21343,7667,20324 ...`\n",
    "<br>\n",
    "\n",
    "In general, each line in data file represents one instance. The format is like: <br>\n",
    "\n",
    "`[label] [Impression:i] [User:u] [CandidateNews0:w1,w2,w3,...] [ClickedNews0:w1,w2,w3,...] ...`\n",
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Global settings and imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System version: 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21) \n",
      "[GCC 7.3.0]\n",
      "Tensorflow version: 1.15.2\n"
     ]
    }
   ],
   "source": [
    "import sys\n",
    "sys.path.append(\"../../\")\n",
    "from reco_utils.recommender.deeprec.deeprec_utils import download_deeprec_resources \n",
    "from reco_utils.recommender.newsrec.newsrec_utils import prepare_hparams\n",
    "from reco_utils.recommender.newsrec.models.npa import NPAModel\n",
    "from reco_utils.recommender.newsrec.io.news_iterator import NewsIterator\n",
    "import papermill as pm\n",
    "from tempfile import TemporaryDirectory\n",
    "import tensorflow as tf\n",
    "import os\n",
    "\n",
    "print(\"System version: {}\".format(sys.version))\n",
    "print(\"Tensorflow version: {}\".format(tf.__version__))\n",
    "\n",
    "tmpdir = TemporaryDirectory()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download and load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 21.2k/21.2k [00:01<00:00, 14.0kKB/s]\n"
     ]
    }
   ],
   "source": [
    "data_path = tmpdir.name\n",
    "yaml_file = os.path.join(data_path, r'npa.yaml')\n",
    "train_file = os.path.join(data_path, r'train.txt')\n",
    "valid_file = os.path.join(data_path, r'test.txt')\n",
    "wordEmb_file = os.path.join(data_path, r'embedding.npy')\n",
    "if not os.path.exists(yaml_file):\n",
    "    download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/newsrec/', data_path, 'npa.zip')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create hyper-parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "epochs=5\n",
    "seed=42"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:\n",
      "The TensorFlow contrib module will not be included in TensorFlow 2.0.\n",
      "For more information, please see:\n",
      "  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md\n",
      "  * https://github.com/tensorflow/addons\n",
      "  * https://github.com/tensorflow/io (for I/O related ops)\n",
      "If you depend on functionality not listed there, please file an issue.\n",
      "\n",
      "data_format=news,iterator_type=None,wordEmb_file=/tmp/tmph3zpuc6n/embedding.npy,doc_size=10,title_size=None,body_size=None,word_emb_dim=100,word_size=28929,user_num=10338,vert_num=None,subvert_num=None,his_size=50,npratio=4,dropout=0.2,attention_hidden_dim=200,head_num=4,head_dim=100,cnn_activation=relu,dense_activation=None,filter_num=400,window_size=3,vert_emb_dim=100,subvert_emb_dim=100,gru_unit=400,type=ini,user_emb_dim=50,learning_rate=0.0001,loss=cross_entropy_loss,optimizer=adam,epochs=5,batch_size=64,show_step=100000,metrics=['group_auc', 'mean_mrr', 'ndcg@5;10']\n"
     ]
    }
   ],
   "source": [
    "hparams = prepare_hparams(yaml_file, wordEmb_file=wordEmb_file, epochs=epochs)\n",
    "print(hparams)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "iterator = NewsIterator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the NPA model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From ../../reco_utils/recommender/newsrec/models/base_model.py:39: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.\n",
      "\n",
      "WARNING:tensorflow:From /data/anaconda/envs/reco_gpu/lib/python3.6/site-packages/tensorflow_core/python/keras/initializers.py:119: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "Call initializer instance with the dtype argument instead of passing it to the constructor\n",
      "WARNING:tensorflow:From /data/anaconda/envs/reco_gpu/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "If using Keras pass *_constraint arguments to layers.\n",
      "WARNING:tensorflow:From ../../reco_utils/recommender/newsrec/models/base_model.py:54: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "model = NPAModel(hparams, iterator, seed=seed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From ../../reco_utils/recommender/newsrec/io/news_iterator.py:95: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.\n",
      "\n",
      "{'group_auc': 0.51, 'mean_mrr': 0.163, 'ndcg@5': 0.1522, 'ndcg@10': 0.2123}\n"
     ]
    }
   ],
   "source": [
    "print(model.run_eval(valid_file))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "at epoch 1\n",
      "train info: logloss loss:1.6397553108176406\n",
      "eval info: group_auc:0.538, mean_mrr:0.1686, ndcg@10:0.2298, ndcg@5:0.1625\n",
      "at epoch 1 , train time: 13.9 eval time: 8.9\n",
      "at epoch 2\n",
      "train info: logloss loss:1.590539211156417\n",
      "eval info: group_auc:0.5392, mean_mrr:0.1737, ndcg@10:0.2335, ndcg@5:0.1673\n",
      "at epoch 2 , train time: 12.5 eval time: 8.7\n",
      "at epoch 3\n",
      "train info: logloss loss:1.5472524715929614\n",
      "eval info: group_auc:0.5517, mean_mrr:0.1739, ndcg@10:0.2385, ndcg@5:0.1659\n",
      "at epoch 3 , train time: 12.6 eval time: 8.8\n",
      "at epoch 4\n",
      "train info: logloss loss:1.4370532391022663\n",
      "eval info: group_auc:0.5575, mean_mrr:0.1741, ndcg@10:0.2411, ndcg@5:0.1676\n",
      "at epoch 4 , train time: 12.4 eval time: 8.9\n",
      "at epoch 5\n",
      "train info: logloss loss:1.3469764787323621\n",
      "eval info: group_auc:0.5609, mean_mrr:0.1783, ndcg@10:0.2486, ndcg@5:0.1697\n",
      "at epoch 5 , train time: 12.4 eval time: 8.7\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<reco_utils.recommender.newsrec.models.npa.NPAModel at 0x7fa6dc188358>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.fit(train_file, valid_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'group_auc': 0.5609, 'mean_mrr': 0.1783, 'ndcg@5': 0.1697, 'ndcg@10': 0.2486}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/data/anaconda/envs/reco_gpu/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: Function record is deprecated and will be removed in verison 1.0.0 (current version 0.19.1). Please see `scrapbook.glue` (nteract-scrapbook) as a replacement for this functionality.\n",
      "  This is separate from the ipykernel package so we can avoid doing imports until\n"
     ]
    },
    {
     "data": {
      "application/papermill.record+json": {
       "res_syn": {
        "group_auc": 0.5609,
        "mean_mrr": 0.1783,
        "ndcg@10": 0.2486,
        "ndcg@5": 0.1697
       }
      }
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "res_syn = model.run_eval(valid_file)\n",
    "print(res_syn)\n",
    "pm.record(\"res_syn\", res_syn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reference\n",
    "\\[1\\] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang and Xing Xie: NPA: Neural News Recommendation with Personalized Attention, KDD 2019, ADS track.<br>"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
